Systems and Methods for Scalable Composition of Media Streams for Real-Time Multimedia Communication

ABSTRACT

A new approach is proposed that contemplates systems and methods to support the operation of a Virtual Media Room or Virtual Meeting Room (VMR), wherein each VMR can accept from a plurality of participants at different geographic locations a variety of video conferencing feeds from video conference endpoints that can be either proprietary or standards-based and enable a multi-party video conferencing session among the plurality of participants by composing one composite audio and video stream for each of the participants. Each single VMR can be implemented across an infrastructure of globally distributed set of servers/media processing nodes co-located in Points of Presence (POPs) for Internet access. Each VMR also gives its users a rich set of conferencing and collaboration interaction hitherto not experienced by video conferencing users.

RELATED APPLICATIONS

This application claims priority to U.S. Provisional Patent ApplicationNo. 61/334,043, filed May 12, 2010, and entitled “Systems and methodsfor virtualized video conferencing across multiple standard andproprietary standards,” and is hereby incorporated herein by reference.

This application claims priority to U.S. Provisional Patent ApplicationNo. 61/334,045, filed May 12, 2010, and entitled “Systems and methodsfor virtual media room for video conferencing,” and is herebyincorporated herein by reference.

This application claims priority to U.S. Provisional Patent ApplicationNo. 61/334,050, filed May 12, 2010, and entitled “Systems and methodsfor distributed global infrastructure to support virtualized videoconferencing,” and is hereby incorporated herein by reference.

This application claims priority to U.S. Provisional Patent ApplicationNo. 61/334,054, filed May 12, 2010, and entitled “Systems and methodsfor customized user experience to virtualized video conferencing,” andis hereby incorporated herein by reference.

BACKGROUND

Videoconferencing in the enterprise has seen rapid growth in recentyears as businesses have become global and employees interact with alarge ecosystem of remote employees, partners, vendors and customers. Atthe same time, availability of cheap software solutions in the consumersector and widespread availability of video cameras in laptop and mobiledevices has fueled wide adoption of video chat to stay connected withfamily and friends.

However, the landscape of options available for videoconferencingremains fragmented into isolated islands that cannot communicate wellwith each other. Within the enterprise there are hardware basedconference rooms equipped with videoconferencing systems from vendorssuch as Polycom, Tandberg and LifeSize, and high end Telepresencesystems popularized by vendors such as Cisco. At the lower end of theprice spectrum are software based enterprise videoconferencingapplications such as Microsoft's Lync as well as consumer video chatapplications such as Skype, GoogleTalk and Apple's FaceTime.

There are significant trade-offs between price, quality and reach whenchoosing to use any of the above systems for a video call. Largecorporations invest hundreds of thousands of dollars in theirTelepresence systems to achieve low latency, high-definition calls butcan only reach a small subset of people that have access to similarsystems. Medium sized businesses invest tens of thousands of dollars intheir hardware based systems to achieve up to 720p High Definition (HD)quality. They buy hardware Multi-Party Conferencing Units (MCU) worthhundreds of thousands of dollars with a fixed number of “ports” and usethese to communicate between their different branch offices, but are ata loss when it comes to communicating easily with systems outside theircompany. Companies that cannot afford these settle for lower qualitybest-effort experiences using clients such as Skype, but on the flipside are able to easily connect with others whether they be inside oroutside their own companies. Average users find these trade-offs whenusing videoconferencing too complicated to understand compared to audiocalls using mobile or landline telephones that “just work” without themthinking about all of these trade-offs. As a result there is lowadoption of videoconferencing in business even though the technology iseasily available and affordable to most people.

Today, more than ever before, there is a need for a service that removesthis tradeoff and provides a high-quality video call at almost the priceof an audio call without the user having to think about complicatedtrade-offs. Such a service would connect disparate hardware and softwarevideoconferencing and chat systems from different vendors, talkingdifferent protocols (H.323, SIP, XMPP, proprietary) and have differentvideo and audio codecs talk to each other. It would offer very lowlatency and a much better viewing experience than current solutions. Itwould be hosted in the Internet/cloud, thereby removing the need forcomplicated equipment with significant capital and operating investmentwithin the enterprise. Ease of use would be as simple as setting up anaudio conference call without the need for complicated provisioningarrangements from corporate IT.

The foregoing examples of the related art and limitations relatedtherewith are intended to be illustrative and not exclusive. Otherlimitations of the related art will become apparent upon a reading ofthe specification and a study of the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an example of a system to support operations of a virtualmeeting room (VMR) across multiple standard and proprietary videoconference systems.

FIG. 2 depicts a flowchart of an example of a process to operations of aVMR across multiple standard and proprietary video conference systems.

FIG. 3 depicts an example of various components of a media processingnode.

FIGS. 4( a)-(b) depict diagrams of examples of media encoding under asimple dyadic scenario.

FIG. 5 depicts an example of a diagram for highly scalable audio mixingand sampling.

FIG. 6 depicts an example of an Internet/cloud-based audio acoustic echocanceller.

FIGS. 7( a)-(c) depict an example of a multi-phased media streamdistribution process achieved locally on a LAN present in each POP oracross multiple POPs on the WAN.

FIG. 8 depicts examples of software components of global infrastructureengine to support the VMR.

FIG. 9 depicts an example illustrating a high-level mechanism for faulttolerant protocol handling to prevent improper input from causinginstability and possible security breach.

FIG. 10 depicts an example illustrating techniques for firewalltraversal.

FIG. 11 depicts an example of a diagram to manage and control the videoconference.

FIG. 12 depicts an example of a diagram for high-quality event sharingusing MCUs of the global infrastructure engine.

FIG. 13 depicts an example of one way in which a laptop or mobile phonecould be associated with a conference room system.

FIG. 14 depicts an example of a diagram for providing welcome screencontent to the participants.

FIG. 15 depicts an example of a diagram for personalizablevideoconference rooms on a per call basis.

FIG. 16 depicts an example of a single online “home” for personalizedsharing one's desktop, laptop and/or mobile screen.

FIG. 17 depicts an example of one-click video conference call plug-invia a mailbox.

FIG. 18 depicts an example of a diagram for delivering a virtual realityexperience to the participants.

FIG. 19 depicts an example of a diagram for augmented-reality userinteraction services to the participants.

DETAILED DESCRIPTION OF EMBODIMENTS

The approach is illustrated by way of example and not by way oflimitation in the figures of the accompanying drawings in which likereferences indicate similar elements. It should be noted that referencesto “an” or “one” or “some” embodiment(s) in this disclosure are notnecessarily to the same embodiment, and such references mean at leastone.

A new approach is proposed that contemplates systems and methods tosupport the operation of a Virtual Media Room or Virtual Meeting Room(VMR), wherein each VMR can accept from a plurality of participants atdifferent geographic locations a variety of video conferencing feeds ofaudio, video, presentation and other media streams from video conferenceendpoints and other multimedia-enabled devices that can be eitherproprietary or standards-based and enable a multi-party orpoint-to-point video conferencing session among the plurality ofparticipants. For non-limiting examples, video feeds from proprietaryvideo conference endpoints include but are not limited to, Skype, whilevideo feeds from standards-based video conference endpoints include butare not limited to, H.323 and SIP. Each single VMR can be implementedand supported across an infrastructure of globally distributed set ofcommodity servers acting as media processing nodes co-located in Pointsof Presence (POPs) for Internet access, wherein such massivelydistributed architecture can support thousands of simultaneously activeVMRs in a reservation-less manner and yet is transparent to the users ofthe VMRs. Each VMR gives its users a rich set of conferencing andcollaboration interaction hitherto not experienced by video conferencingparticipants. These interactions encompass controlling of a videoconferencing session, its configuration, the visual layout of theconferencing participants, customization of the VMR and adaptation ofthe room to different vertical applications. For a non-limiting example,one such use of the VMR is to facilitate point-to-point calls betweentwo disparate endpoints such as a Skype client and a standards-basedH,323 endpoint wherein the Skype user initiates a call to another userwith no knowledge of the other user's endpoint technology and a VMR isautomatically provisioned between the two parties after determining theneed for the translation needed between the two endpoints.

The approach further utilizes virtual reality and augmented-realitytechniques to transform the video and audio streams from theparticipants in various customizable ways to achieve a rich set of userexperiences. A globally distributed infrastructure supports the sharingof the event among the participants at geographically distributedlocations through a plurality of MCUs (Multipoint Control Unit), eachconfigured to process the plurality of audio and video streams from theplurality of video conference endpoints in real time.

Compared to conventional video conferencing systems that require everyparticipant to the video conference to follow the same communicationstandard or protocol, a VMR allows the users/participants of a videoconference to participate in a multi-party or point-to-point videoconferencing session in device and protocol independent fashion. Byconducting manipulation of the video and audio streams transparently inthe Internet/cloud without end user involvement, the proposed approachbrings together video conference systems of different devices andprotocols of video conferencing and video chat that exist in the worldtoday as one integrated system.

Hosting the VMR in the Internet/cloud allows for the participants toinitiate a call to anyone and have the VMR ring them at all theirregistered endpoint devices and have the callee pick up the call fromany endpoint device that they wish to transparently. A VMR hosted in theInternet/cloud enables any participant to upload media content to thecloud and have it be retransmitted to other participants in formats oftheir choice, with or without modifications.

FIG. 1 depicts an example of a system 100 to support operations of a VMRacross multiple standard and proprietary video conference systems.Although the diagrams depict components as functionally separate, suchdepiction is merely for illustrative purposes. It will be apparent thatthe components portrayed in this figure can be arbitrarily combined ordivided into separate software, firmware and/or hardware components.Furthermore, it will also be apparent that such components, regardlessof how they are combined or divided, can execute on the same host ormultiple hosts or several virtualized instances on one or more hosts,and wherein the multiple hosts can be connected by one or more networksgeographically distributed anywhere in the world.

In the example of FIG. 1, the system 100 includes at least a VMR engine102 that operates the VMRs, a global infrastructure engine 104 thatsupports the operations of the VMRs, and a user experience engine 106that enhance the users' experience of the VMRs.

As used herein, the term engine refers to software, firmware, hardware,or other component that is used to effectuate a purpose. The engine willtypically include software instructions that are stored in non-volatilememory (also referred to as secondary memory). When the softwareinstructions are executed, at least a subset of the softwareinstructions is loaded into memory (also referred to as primary memory)by a processor. The processor then executes the software instructions inmemory. The processor may be a shared processor, a dedicated processor,or a combination of shared or dedicated processors. A typical programwill include calls to hardware components (such as I/O devices), whichtypically requires the execution of drivers. The drivers may or may notbe considered part of the engine, but the distinction is not critical.

In the example of FIG. 1, each of the engines can run on one or morehosting devices (hosts). Here, a host can be a computing device, acommunication device, a storage device, or any electronic device capableof running a software component. For non-limiting examples, a computingdevice can be but is not limited to a laptop PC, a desktop PC, a tabletPC, an iPad, an iPod, an iPhone, iTouch, Google's Android device, a PDA,or a server machine that is a physical or virtual server and hosted inan Internet public or private data center by a service provider or athird party for the service provider or inside an enterprise's privatedata center or office premises. A storage device can be but is notlimited to a hard disk drive, a flash memory drive, or any portablestorage device. A communication device can be but is not limited to amobile phone.

In the example of FIG. 1, each of the VMR engine 102, globalinfrastructure engine 104, and user experience engine 106 has one ormore communication interfaces (not shown), which are software componentsthat enable the engines to communicate with each other following certaincommunication protocols, such as TCP/IP protocol, over one or morecommunication networks. Here, the communication networks can be but arenot limited to, internet, intranet, wide area network (WAN), local areanetwork (LAN), wireless network, Bluetooth, WiFi, and mobilecommunication networks. The physical connections of the network and thecommunication protocols are well known to those of skill in the art.

FIG. 2 depicts a flowchart of an example of a process to operations of aVMR across multiple standard and proprietary video conference systems.Although this figure depicts functional steps in a particular order forpurposes of illustration, the process is not limited to any particularorder or arrangement of steps. One skilled in the relevant art willappreciate that the various steps portrayed in this figure could beomitted, rearranged, combined and/or adapted in various ways.

In the example of FIG. 2, the flowchart 200 starts at block 202 whereaccepting from a plurality of participants a plurality of videoconferencing feeds from a plurality of video conference endpoints eachassociated with one of the plurality of participants to a virtualmeeting room (VMR) are accepted. The flowchart 200 continues to block204 where, for each of the plurality of participants to the VMR, theplurality of video conference feeds are converted and composed into acomposite video and audio stream compatible with the video conferenceendpoint associated with the participant. The flowchart 200 continues toblock 206 where a multi-party video conferencing session is enabled inreal time among the plurality of participants is enabled, wherein theplurality of video conference endpoints are of different types. Theflowchart 200 ends at block 208 where the composite audio and videostream is rendered to each of the plurality of participants to the VMRfor an enhanced user experience.

Virtual Meeting/Media Room (VMR)

In the example of FIG. 1, VMR engine 102 allows participants to a videoconference room to participate via all types of video conferencingendpoints. VMR engine 102 coalesces the video conferencing feeds fromthe vagaries of different manufacturers' video equipment and/or softwareimplementations of video conferencing systems and endpoints in real timein order to effectively handle a multi-party video conferencing session.More specifically, VMR engine 102 converts and composes in real time theplurality of video conference feeds from the participants to a VMR to acomposite video and audio stream compatible with each of the videoconference endpoints, i.e., the video conference system associated witheach of the participants to the VMR. Here, the conversion of the videoconference feeds includes at least one or more of the following areas ofthe video conference feeds:

-   -   Video encoding formats (e.g., H.264, proprietary, etc.)    -   Video encoding profiles and levels (e.g. H264 Main Profile, H264        Constrained Baseline Profile, etc.)    -   Audio encoding formats (e.g., SILK, G7xx, etc.)    -   Communication protocols (e.g., H.323, SIP, XMPP, proprietary,        etc.)    -   Video resolutions (e.g., QC/SIF, C/SIF, Q/VGA, High        Definition—720p/1080p, etc.)    -   Screen ratios (e.g., 4:3, 16:9, custom, etc.)    -   Bitrates for audio streams (e.g. narrowband, wideband, etc.)    -   Bitrates for video streams (e.g. 1.5 Mbps, 768 kbps, etc.)    -   Encryption standards (e.g., AES, proprietary, etc.)    -   Acoustic considerations (e.g., echo cancellations, noise        reduction, etc.)        The technologies involved for the conversion of the video        conference feeds include but are not limited to, transcoding,        upscaling downscaling, transrating, mixing video and audio        streams, adding and removing video, audio and other multimedia        streams, noise reduction and automatic gain control (AGC) of the        video conference feeds.

In some embodiments, VMR engine 102 will facilitate point-to-point callsbetween two disparate endpoints such as a Skype client and astandards-based H.323 endpoint wherein the Skype user initiates a callto another user with no knowledge of the other user's endpointtechnology and a VMR is automatically provisioned between the twoparties after determining the need for the translation needed betweenthe two endpoints. In this case the VMR is used to allow two endpointsto communicate without the users needing to know or worry about thedifferences between the protocols, video encodings, audio encodings orother technologies used by the endpoints.

In some embodiments, VMR engine 102 composes and renders the compositevideo and audio stream to closely match the capabilities of the videoconferencing endpoint associated with each of the participants in orderfor the participant to have an effective meeting experience. Whencompositing the frames of the video and audio stream for finalrendition, VMR engine 102 may take into consideration the innovativevideo layouts of the participants as well as the activity of variousparticipants to the video conference. For a non-limiting example, VMRengine 102 may give more prominence to the active speaker at theconference relative to other participants. In some embodiments, VMRengine 102 may also accommodate multimedia data streams/contentaccompanying the video conference feeds as part of the compositeaudio/video steam for collaboration, wherein multimedia data streams mayinclude but are not limited to, slides sharing, whiteboards, videostreams and desktop screens. Chat style messages for real timecommunication amongst the participants is also supported. Participantstatus information including but not limited to the type of endpointused, signal quality received and audio/video mute status could also bedisplayed for all of the participants.

In some embodiments, media processing node 300 is designed to convertand compose several video conference feeds of video and audio streams inreal-time to create and render one or more composite multimedia streamsfor each participant to the VMR. As shown in the example depicted inFIG. 3, media processing node 300 may include as its components one ormore of: video compositor 302, video transcoder 304, distributedmulticast video switch 306, audio transcoder/pre-processor 308,distributed multicast audio mixer 310, protocol connector 312, and adistributed conference session controller 314. In the case of video, thevideo streams from the participants are made available at the mediaprocessing node 300 in three (or more) forms:

original compressed video

uncompressed raw video

a lower resolution compressed thumbnail video

In the example of FIG. 3, video compositor 302 of media processing node300 subscribes to whichever video stream it needs based on the set ofvideos needed to be composed and rendered to the participants. The two(or more) compressed forms of the video streams listed above aretranscoded by video transcoder 304 sent by distributed multicast videoswitch 306 using a multicast address on the network so that other(remote) media processing nodes that want these video streams cansubscribe to them as needed. This scheme allows the entire cluster ofnodes (locally and globally) to share and/or exchange the audio andvideo streams they need in the most efficient manner. These streamscould be transmitted over the public internet, over a private network orover a provisioned overlay network with service level guarantees. Usingthis approach, video compositor 302 may show various composites,including but limited to, just the active speaker, two peopleside-by-side if they are having a conversation, and any other customformat as requested by a participant, which may include transformationsof the video into other representations as well.

In the example of FIG. 3, video transcoder 304 of media processing node300 encodes and decodes composite video streams efficiently, wherecharacteristics of each individual stream can be extracted duringdecoding. Here, video transcoder 304 gathers knowledge provided by thecoded bitstream of the composite video streams, wherein such gatheredinformation include but are not limited to:

Motion Vectors (MVs)

cbp and skip.

Static macro blocks (0 motion vector and no cbp)

quantization value (Qp)

frame rate.

These characteristics are used to build up a metadata field associatedwith the uncompressed video stream as well as a synthetic stream ofcompressed or otherwise transformed data.

In some embodiments, video compositor 302 not only composes the rawvideo stream into a composite video stream but also builds up acomposite metadata field in order to apply similar operations (includingboth 2D and 3D operations) outlined in the metadata field to theindividual video streams of the composite video. For a non-limitingexample, motion vectors need to be applied with the same transformationthat video compositor 302 may apply to each raw video stream, includingbut not limited to, scaling, rotation, translation, shearing. Thismetadata could be used for other non-real-time multimedia servicesincluding but not limited to recorded streams and annotated streams foroffline search and indexing.

FIGS. 4( a)-(b) depict diagrams of examples of media encoding under asimple dyadic scenario of where the input is scaled down by a factor oftwo, where FIG. 4( a) illustrates how macro blocks (MB) are treated andFIG. 4( b) illustrates how motion vector are processed. Here, videocompositor 402 does its best to align raw videos on macro block'sboundary for best result. For that purpose, layouts used by videocompositor 402 are chosen judiciously to minimize the amount of macroblocks covered by a video stream while maintaining the target size ofthe video. Optionally video compositor 402 may mark the boundary area ofvideo feeds as such since these feeds typically contains lessinformation in the context of video conferencing. This metadata fieldcould include information such as the positions of speakers in the videostream who can then be segmented and compressed separately. Thecomposite metadata field is then processed to provide meaningfulinformation on a macro block basis and to best match the encodingtechnology. For non-limiting examples,

-   -   In the case of H.264, the processing takes into account the fact        that macro blocks can be subdivided down to 4×4 sub-blocks.    -   In the case of H.263, the macro blocks cannot be subdivided or        can only be subdivided in 8×8 blocks depending of the annexes        used.    -   In the case of H.261, the macro blocks are not subdivided.

In some embodiments, video transcoder 304 is given the composite rawvideo and composite metadata field and video transcoder 304 then usesthe knowledge provided by the metadata field to reduce computation andfocus on meaningful area. For non-limiting examples:

-   -   Skip macro block detection: extremely fast skip macro block        detection can be achieved by choosing skip automatically if the        composite metadata point to a static MB.    -   MV search range: search range of MV can be dynamically adapted        based on composite metadata field information. The search range        is directly evaluated based on the MV on the matching MB in the        metadata field.    -   MV predictor: the MV indicated in the composite metadata field        is used as a primary predictor during motion estimation.    -   Quantization: the quantization value (Qp) used during encoding        is bounded by the value provided in the composite metadata        field.    -   Frame rate adaptation. area of the composite with lower frame        rate are marked as skipped when no update for that frame rate is        given    -   Area of composite with no motion gets fewer bits.    -   Areas on the border of each video are encoded with fewer bits.

In the case of audio, audio transcoder/pre-processor 308 mixes eachparticipant's audio stream along with the audio streams from otherparticipants received at media processing node 300 through distributedmulticast audio mixer 310. The mixed output can also be sent over thenetwork via distributed multicast audio mixer 310 so that all othernodes that want to receive this stream may subscribe to it and mix it inwith the local streams on their media processing nodes. Such an approachmakes it possible for the global infrastructure engine 104 to providemixed audio output to all participants in a video conference hosted at aVMR in a distributed manner.

In some embodiments, audio transcoder/pre-processor 308 enables highlyscalable audio mixing that mixes audio signals from various codec at thebest sampling rate as shown in the example depicted in the diagram ofFIG. 5. More specifically, audio transcoder/pre-processor 308 firstdetermines the best possible sampling rate to mix audio based on thevideo endpoints associated with participants in a particular VMR. Audiotranscoder/pre-processor 308 then estimates noise on each channel comingin and determines voice activities on each channel. Only active channelsare mixed to eliminate all noise in the VMR and the channels areequalized to boost signal and reduce noise. Finally, audiotranscoder/pre-processor 308 mixes full channels by normalization ofchannels and creates unique streams for each participant based on allother audio streams in the VMR to eliminate echo on the channel.

In some embodiments, audio transcoder/pre-processor 308 enablesreal-time language translation and other speech-to-text orspeech-to-visual services including but not limited tolanguage-to-English translation and subtitling in real-time, interactingand modifying content in the call through voice commands and pulling indata in real-time from the internet through speech-to-visual services.

Since bad or non-existent echo cancellation on the part of one or moreparticipants in a conference call often deteriorates the wholeconference for everyone in the VMR, in some embodiments, audiotranscoder/pre-processor 308 enables automatic Internet/cloud-baseddetermination of the need for acoustic echo cancellation at a videoconference endpoint as shown in the example depicted in the diagram ofFIG. 6. First, audio transcoder/pre-processor 308 determines theroundtrip delay on the audio stream going out of the MCU to the endpointand back. Audio transcoder/pre-processor 308 then estimates the longterm and short term power consumption on the speaker and microphonesignal going out of MCU. The natural loss on the endpoint can then becalculated by the following formula:

ERL=log₁₀(power(Micphone Signal)/power(Speaker Signal))

If the natural loss of the endpoint is greater than 24 dB, there is noneed to do echo cancellation as the endpoint is taking care of echo.

Distributed Infrastructure

Traditional approaches to build an infrastructure for video conferencingthat meets these requirements would often demand custom hardware thatuses FPGAs (Field Programmable Logic Arrays) and DSPs (Digital SignalProcessors) to deliver low latency media processing and to chain thehardware together to handle the large load. Such a customized hardwaresystem is not very flexible in the A/V formats and communicationprotocols it can handle since the hardware logic and DSP code is writtenand optimized for a specific set of A/V codecs and formats. Such systemcan also be very expensive to build requiring large R&D teams andmulti-year design cycles with specialized engineering skills.

Supporting the operations of the VMR engine 102 in FIG. 1 requires amulti-protocol video bridging solution known in the industry as a MCU(Multipoint Control Unit) as the media processing node 300 discussedabove to process and compose video conference feeds from variousendpoints. Traditionally, an MCU is built with custom hardware that tiestogether 100s of DSPs with special purpose FPGAs, resulting in largeMCUs with many boards of DSPs in expensive bladed rack mounted systems.Even with such expensive systems, it is only possible to achieve 10s or100s of participants connected to an MCU when the participants use HDvideo. To achieve larger scale, a service provider has to buy many suchbladed boxes and put some load balancers and custom scripting together.But such an approach is expensive, hard to manage, hard to program theDSP software and FPGA code used, and hard to distribute seamlesslyacross the globe. Additionally, such system usually runs a proprietaryOS that makes it hard to add third party software and in general providenew features rapidly, and some of the functionality, such as the abilityto do flexible compositing for participants in a virtual room when theroom spans across multiple MCUs, is lost.

In the example of FIG. 1, global infrastructure engine 104 enablesefficient and scalable processing and compositing of media streams bybuilding the MCUs as the media processing nodes for video streamprocessing from off-the-shelf components, such as Linux/x86 CPUs and PCGPUs (Graphics Processing Units) instead of custom hardware. These MCUscan be deployed in a rack-and-stack cloud-computing style and henceachieves the most scalable and cost/performance efficient approach tosupport the VMR service. The x86 architecture has improved vastly overthe last 5 years in its Digital Signal Processing (DSP) capabilities.Additionally, off-the-shelf Graphics Processing Units (GPU) used forrendering PC graphics can be used to augment the processing power of theCPU.

In the example of FIG. 1, global infrastructure engine 104 that supportsand enables the operations of the VMRs has at least one or more of thefollowing attributes:

Ability to support wide variety of audio video formats and protocols;

Scalable mixing and composition of the audio and video streams;

Service delivered across the globe with minimized latency;

Capital efficient to build and cost efficient to operate.

In some embodiments, global infrastructure engine 104 enables to clusterthe x86 servers both locally on a LAN as well as across geographies asthe media processing nodes 300 for the MCUs to achieve near unlimitedscaling. All of the media processing nodes 300 work together as onegiant MCU. In some embodiments, such clustered MCU design makes use ofnetwork layer multicast and a novel multi-bit-rate stream distributionscheme to achieve the unlimited scaling. Under such design, globalinfrastructure engine 104 is able to achieve great scalability in termsof number of participants per call, geographic distribution of callers,as well as distribution of calls across multiple POPs worldwide.

In some embodiments, global infrastructure engine 104 distributes theMCUs acround the globe in Points of Presence (POPs) at third party datacenters to process video conference feeds coming from video conferenceendpoints having different communication protocols. Each POP has as muchprocessing power (e.g., servers) as required to handle the load fromthat geographical region the POP is located. Users/participantsconnecting to the video conference system 100 are directed by the globalinfrastructure engine 104 to the closest POP (the “connector”) so as toallow them to minimize their latency. Once the participants reach thePOP of the global infrastructure engine 104, their conference feeds ofaudio and video streams can be carried on a high performance networkbetween the POPs. Such distributed infrastructure of globalinfrastructure engine 104 enables the biggest media processing engine(VMR engine 102) ever built to act as one single system 100. Such asystem would take a lot of capital costs, R&D costs and immense amountof operation scripting coordination if it were to be built using thetraditional approach of DSP/FPGA-based custom hardware.

FIGS. 7( a)-(c) depict an example of a multi-phased media streamdistribution process achieved locally on a LAN present in each POP oracross multiple POPs on the WAN (Wide Area Network). FIG. 7( a) depictsPhase I of media stream distribution—single node media distribution witha POP, where video conference feeds from participants to a videoconference via for non-limiting examples, room systems running H.323,PCs running H.323, PCs running Skype, all connect to one node in a POPbased on proximity to conference host, where the video conference feedsare load balanced but not clustered among nodes in the POP. FIG. 7( b)depicts Phase II of media stream distribution—clustered nodes mediadistribution with a POP, wherein video conference feeds from theparticipants are load balanced among cluster of nodes at the POP, andthe audio/video streams are distributed/overflowed among the nodes inthe POP. FIG. 7( c) depicts Phase III of media streamdistribution—complete media distribution among both the cluster of nodeswith the POP and among different POPs as well, where some participantsto the conference may connect to their closest POPs instead of a singlePOP.

In some embodiments, the global infrastructure engine 104 may allow formultiple other globally distributed private networks to connect to it,including but not limited to deployments of videoconferencing servicessuch as Microsoft Lync that require federation (i.e. cooperation amongmultiple organizational entities) at edge nodes and translation anddecoding of several communication and transport protocols.

In some embodiments, global infrastructure engine 104 may limit videoconference feed from every participant to a video conference to gothrough a maximum of two hops of media processing nodes and/or POPs inthe system. However, it is possible to achieve other types of hierarchywith intermediate media processing nodes that do transcoding ortranscode-less forwarding. Using this scheme, global infrastructureengine 104 is able to provide pseudo-SVC (Scalable Video Coding) toparticipants associated with devices that do not support SVC, i.e., eachof the participants to the video conference supports AVC (Audio-VideoCoding) with appropriate bit rate upshift/downshift capabilities. Theglobal infrastructure engine 104 takes these AVC streams and adapts themto multi-bit-rate AVC streams inside the media distribution network.Under this scheme, it is still possible to use SVC on the devices thatsupport SVC. It is also possible to use SVC on the internal networkinstead of multi-bit-rate AVC streams as such network adapts and growsas the adoption of SVC by the client devices of the participantsincreases.

FIG. 8 depicts examples of software components of global infrastructureengine 104 to support the operation of the VMR. Some of thesecomponents, which include but are not limited to, media gateway engine,media processing engine for transcoding, compositing, mixing and echocancellation among H.26×, G.7xx, and SILK, multi-protocol connectorsamong H.323, Skype, SIP, XMPP, and NAT traversal, Web applications suchas conference control, screen and presentation sharing, chat, etc., aredistributed across the nodes and POPs of the global infrastructureengine 104 for real-time communication. Some components, which includebut are not limited to, user/account management, billing system, NOC(Network operation center) systems for bootstrapping, monitoring, andnode management are run at one or more centralized but redundantmanagement nodes. Other components, which include but are not limitedto, common application framework and platform (e.g., Linux/x86 CPUs,GPUs, package management, clustering) can be run on both the distributednodes and the centralized management nodes.

Whenever input is accepted over an open network to a service, especiallyfrom un-trusted sources, strong validation must occur in order toprevent security breaches, denial of service attacks, and generalinstability to the service. In the case of video conferencing, theaudio/video stream input from a conferencing endpoint that needs to bevalidated may include control protocol messages and compressed mediastreams, both of which must be validated. While it is important that thecode handling the un-trusted input be responsible for doing validationand sanity checks before allowing it to propagate through the system,history has shown that relying on this as the only validation strategyis insufficient. For a non-limiting example, H.323 heavily utilizesAbstract Syntax Notation One (ASN.1) encoding, and most public ASN.1implementations have several security issues over the years and ASN.1'scomplexity makes it nearly impossible to hand-code a parser that iscompletely secure. For another non-limiting example, manyimplementations for H.264 video decoders do not contain bounds checksfor performance reasons, and instead contain system-specific code torestart the codec when it has performed an invalid memory read andtriggered a fault.

In some embodiments, global infrastructure engine 104 provides ahigh-level mechanism for fault tolerant protocol handling to preventimproper input from causing instability and possible security breach viaprotocol connector 212 as illustrated by the example depicted in FIG. 9.All codes that process protocol control messages and compressed audioand video streams are isolated in one or more separate, independent,unprivileged processes. More specifically,

-   -   Separate processes: each incoming connection should cause a new        process to be created by protocol connector 212 to handle it.        This process should also be responsible for decompressing the        incoming media stream, translating the incoming control messages        into internal API calls, and decompressing the media into an        internal uncompressed representation. For a non-limiting        example, inbound H.264 video can be converted into YUV420P        frames before being passed on to another process. The goal        should be that if this process crashes, no other part of the        system will be affected.    -   Independent processes: each connection should be handled in its        own process. A given process should only be responsible for one        video conference endpoint so that if this process crashes, only        that single endpoint will be affected and everyone else in the        system will not notice anything.    -   Unprivileged processes: each process should be as isolated as        possible from the rest of the system. To accomplish this,        ideally each process runs with its own user credentials, and may        use the chroot( ) system call to make most of the file system        inaccessible.    -   Performance considerations: protocol connector 212 may introduce        several processes where typically only one exists brings about        the possibility of performance degradation, especially in a        system handling audio and video streams where a large amount of        data needs to be moved between processes. To that end, shared        memory facilities can be utilized to reduce the amount of data        that needs to be copied.

In some embodiments, global infrastructure engine 104 supportsdistributed fault-tolerant messaging for an Internet/cloud-basedclient-server architecture, wherein the distributed fault-tolerantmessaging provides one or more of the following features:

-   -   Ability to direct unicast, broadcast, multicast and anycast        traffic with both reliable and unreliable delivery mechanisms.    -   Ability to load balance service requests across media processing        nodes and context sensitive or free server classes.    -   Synchronous & asynchronous delivery mechanisms with ability to        deliver messages in spite of process crashes.    -   Priority based and temporal order delivery mechanisms including        atomic broadcasts using efficient fan-out techniques.    -   Ability to implement an efficient fan out using single write and        atomic broadcast.    -   Ability to discard non-real-time queued messages selectively        improving real-time responsiveness.    -   Priority based queuing mechanism with the ability to discard        non-real time events not delivered.    -   A transaction aware messaging system.    -   Integration with a hierarchical entry naming system based on        conference rooms, IP addresses, process names, pids etc.

Traditionally, legacy video endpoints of the video conferenceparticipants, such as video conferencing endpoints using the H.323protocols, typically communicate with other endpoints within a LAN of acorporate or organizational network. There have been attempts made toenable the H.323 endpoints to seamlessly communicate with endpointsoutside the corporate network through firewalls—some of which have beenstandardized in the form of ITU protocol extensions to H.323, namelyH.460.17, 18, 19, 23, 24, while others have been advocated by videoconferencing equipment vendors to include deploying gateway hardware orsoftware within DMZ of the corporate network. However, none of theseattempts have been very successful as evidenced by the fact theinter-organizational calls are still cumbersome and happen only withheavy IT support and involvement.

In some embodiments, global infrastructure engine 104 enables seamlessfirewall traversal for legacy video conference endpoints of the videoconference participants to communicate with other endpoints. Sincelegacy video conferencing endpoints usually implement only standardizedprotocols that do not assume an Internet/cloud-based service beingavailable, global infrastructure engine 104 utilizes at least one ormore of the following techniques for firewall traversal as illustratedby the example depicted in the diagram of FIG. 10:

-   -   Restricting all video conference calls to be outbound calls from        the endpoints behind a firewall going to a server in the global        infrastructure engine 104 that is reachable on a public IP        address on the Internet which is accessible by every user. This        avoids the double firewall issue between two corporate or        organizational networks that makes inter-organization calls much        harder.    -   Keeping the set of UDP/IP ports used to reach the global        infrastructure engine 104 and the set of UDP/IP ports from which        global infrastructure engine 104 distributes media to a small        named subset of ports. This allows corporations with restricted        firewall policies to open their firewalls in a narrow scope        versus making the firewall wide open.    -   Offering a simple web browser-based application that allows any        user to easily run a series of checks to ascertain the nature        and behavior of a corporation's firewall and to determine        whether or not the firewall may be an issue with H.323 endpoints        or any rule change would be needed.    -   Offering an enhanced browser-based application as a tunneling        proxy that enables any user to run a software in a browser or        native PC OS to allow an endpoint to tunnel through the software        to one or more public servers on the Internet. Alternately, the        software can be run in a stand-alone manner on any PC or server        on the network in native or virtual machine form factor to        enable the same tunneling.

In the example of FIG. 1, user experience engine 106 renders multimediacontent including but not limited to the composite audio/video stream toeach of the participants to the VMR for an enhanced User Experience (UE)for the participants. The UE provided by user experience engine 106 tothe participants to the VMR hosted by VMR engine 102 typically comprisesone or more of the following areas:

-   -   Physical interaction with the video conference endpoint. User        experience engine 106 enables controlling the setup and        management of a multi-party video conferencing session in a VMR        in a device/manufacturer independent way. Most of the physical        interaction with the manufacturer supplied remote control can be        subsumed by a web application, wherein the web application can        be launched from any computing or communication device,        including laptop, smart phones or tablet devices. In some        embodiments, these interactions could be driven through speech        or visual commands as well that the Internet/cloud based        software recognizes and translates into actionable events.    -   User interface (UI) associated with a Web application that        controls the participants' interactions with the VMR engine 102.        Here, user experience engine 106 controls interaction of the        moderator and the conferencing participants. Through an        intuitive UI provided by user experience engine 106,        participants to the video conference can control such features        such as video layouts, muting participants, sending chat        messages, share screens and add third party video content.    -   Video/Multimedia content. User experience engine 106 controls        content rendered in the form of screen layouts, composite feeds,        welcome banners, etc. during the video conference as well as        what the participants see when they log into a VMR, what they        physically see on the screen etc. In some embodiments, the UI        and/or the multimedia content could contain information related        to performance metrics for the participant's call experience,        including but not limited to video resolution, video and audio        bitrate, connection quality, packet loss rates for the        connection, carbon offsets gained as a result of the call,        transportation dollars saved and dollars saved in comparison to        traditional MCU-based calls. This also allows for eco-friendly        initiatives such as infrequent flyer miles or a similarly framed        service for miles of travel saved and the bragging rights        associated with various status levels similar to frequent flyer        status levels. Incentive programs can be based out of attaining        different levels of status that encourages participants to use        videoconferencing vs. traveling for a meeting. This gives them        personal incentive to use videoconferencing over and above the        business benefits.    -   Customization of the video conference session for a specific        (e.g., vertical industry) application. User experience engine        106 allows customization of the VMR in order to tailor a video        conference session to the needs of a particular industry so that        the conference participants may experience a new level of        collaboration and meeting effectiveness. Such vertical        industries or specialties include but are not limited to, hiring        and recruiting, distance learning, telemedicine, secure legal        depositions, shared-viewing of real-time events such as sports        and concerts and customer support.    -   Personalization of the VMR as per the moderator's and/or the        participants' preferences and privileges. User experience engine        106 provides the moderator the ability to personalize the        meeting when scheduling a video conference. Examples of such        customization include but are not limited to, the initial        welcome banner, uploading of meeting agenda, specifying the        video layouts that will be used in the session and privileges        given to the session participants.

Despite the fact that most conventional video conference systems costtens of thousands of dollars, they offer very limited freedom andflexibility to the call organizer or to any participants in terms ofcontrolling the user experience during the call. The layouts comepre-configured to a select few options, and the settings that can bemodified during a call are also limited.

In some embodiments, user experience engine 106 providesmoderator-initiated in-meeting/session management and control oversecurity and privacy settings during a particular video conference call,wherein such management and control features include but are not limitedto, muting a particular speaker at the video conference, controllingand/or broadcasting layouts associated with one of the video conferenceendpoints to all or a subset of the participants, and sharing additionalmaterials selectively with a subset of the participants (for anon-limiting example, in an HR vertical application where multipleinterviewers are interviewing one candidate in a common call).

By offering the video conferencing service over the Internet/cloud, userexperience engine 106 eliminates a lot of these limitations of theconventional video conference systems. For a non-limiting example, userexperience engine 106 enables participants associated different types ofvideo conference endpoints to talk to each other over the Internetduring the video conference. For a non-limiting example, participantsfrom H.323 endpoints to talk to participants from desktop clients suchas Skype, and both the moderator and the participants can choose from awide variety of options. In addition, by providing the ability toterminate the service in the cloud, user experience engine 106 enablesaccess to a much richer set of features for a conference call that aparticipant can use compared to a conventional passively bridgedconference call. More specifically, every participant can have controlof one or more of:

-   -   1. Which active participants to the VMR to view in his/her video        windows on the screen of his/her video conference endpoint.    -   2. Layout options for how the different participants should be        shown on the screen of his/her video conference endpoint.    -   3. Layout options on where and how to view the secondary video        channel (screen sharing, presentation sharing, shared viewing of        other content) on the screen of his/her video conference        endpoint.        Using such in-meeting controls, a moderator can control security        and privacy settings for the particular call in ways that prior        art does not allow, or does not provide for.

As shown in the example depicted in the diagram of FIG. 11, themoderator of the call, in addition to the aforementioned options, has aricher suite of options to pick from through a web interface to manageand control the video conference, which include but are not limited to,

-   -   1. Muting subsets of participants during a call.    -   2. Sharing content with subsets of participants during the        course of a call.    -   3. Prescribing a standard layout of the screen of his/her video        conference point and a set of displayed callers for other        participants to see.    -   4. Choosing to display caller-specific metadata on the        respective video windows of a subset of the participants,        including user-name, site name, and any other metadata.    -   5. Easy and seamless way to add or remove participants from the        video conference call through a real-time, dynamic web        interface.    -   6. Easily customizable welcome screen displayed to video callers        on joining the call that can display information relevant to the        call as well as any audio or video materials that the service        provider or the call moderators wishes for the participants to        see.

In some embodiments, user experience engine 106 enables privateconferences in a VMR by creating sub-rooms in main VMR that any subsetof the participants to the main VMR could join and have private chats.For a non-limiting example, participants can invite others for a quickaudio/video or text conversation while being on hold in the main VMR.

A shared experience of events among participants to a video conferenceoften requires all participants to be physically present at the sameplace. Otherwise, when it happens over the Internet, the quality isoften very poor and the steps needed to achieve this are quitechallenging for the average person to pursue this as a viabletechnological option.

In some embodiments, user experience engine 106 provides collaborativeviewing of events through VMRs that can be booked and shared among theparticipants so that they are able to experience the joy ofsimultaneously participating in an event and sharing the experiencetogether via a video conference. For a non-limiting example, the sharedevent can be a Super Bowl game that people want to enjoy with friends,or a quick session to watch a few movie trailers together among a groupof friends to decide which one to go watch in the theater.

In some embodiments, user experience engine 106 utilizes the MCUs of theglobal infrastructure engine 104 to offer an easy, quick, andhigh-quality solution for event sharing as illustrated by the exampledepicted in the diagram of FIG. 12. More specifically, user experienceengine 106 enables one initiating participant 1202 to invite a group ofother participants 1204 for a shared video conference call in a VMR 1208via a web interface 1206. Once everyone joins in VMR 1206 to shareonline videos and content, initiating participant 1202 can then presentthe link to the website where the content to be shared 1208 is locatedand the content 1208 starts streaming into the same VMR 1206 directlyfrom the content source whether the content is co-located with theinitiating participant 1202 or located on the Internet on a 3^(rd) partyweb site or content store. Participant 1202 may continue to haveconversations with other participants 904 while watching this content1210. features that include but are not limited to, the layout of thecontent in terms of where it is visible, its audio level, whether itshould be muted or not, whether it should be paused or removedtemporarily are in the control of the person sharing the content 1210similar to the management and control by a moderator to a videoconference as discussed above. Such an approach provides a compellingand novel way to watch live events among groups of people whoselocations are geographically distributed, yet want to experience anevent together. This enables a whole new set of applications aroundactive remote participation in live professional events such asconferences and social events such as weddings.

In some embodiments, user experience engine 106 enables multiple viewsand device-independent control by the participants to the videoconference. Here, the video endpoints each have its own user interfaceand in the case of hardware video systems available in conference rooms,the video conference endpoints may each have a remote control that isnot very easy to use. In order to make the user experience of connectingto the VMR simple, user experience engine 106 minimizes the operationsthat need to carry out using the endpoints native interface and move allof those functions to a set of interfaces running on a device familiarto most users—desktop PC, laptop PC, mobile phone or mobile tablet, andthus makes the user experience to control the VMR mostly independent ofthe endpoint devices user interface capabilities. With suchdevice-independent control of the video conference, user experienceengine 106 provides flexibility, ease-of-use, richness of experience andfeature-expansion that it allows to make the experience far morepersonal and meaningful to participants.

In some embodiments, user experience engine 106 may also allow aparticipant to participate in and/or control a video conference usingmultiple devices/video conference endpoints simultaneously. On onedevice such as the video conference room system, the participant canreceive audio and video streams. On another device such as a laptop ortablet, the same participant can send/receive presentation materials,chat messages, etc. and also use it to control the conference such asmuting one or more of the participants, changing the layout on thescreens of the video conference endpoints with PIP for the presentation,etc. The actions on the laptop are reflected on the video conferenceroom system since both are connect to the same VMR hosting the videoconference.

Joining a video conference from H323 endpoints today often involvecumbersome steps to be followed via a remote-control for the device. Inaddition to logistical issues such as locating the remote in therequired room, there are learning-curve related issues in terms ofpicking the correct number to call from the directory, entering aspecified code for the call from the remote etc. Both endpointparticipants as well as desktop participants are directly placed intoconference with their video devices turned on upon joining a call.

In some embodiments, user experience engine 106 offers radically newways to improve and simplify this user experience by rendering to theparticipants welcome screen content that includes but is not limited to,interactive welcome handshake, splash screen, interactions for enteringroom number related info, welcome video, etc. for video conferences asshown in the example depicted in the diagram of FIG. 14. To join a callfrom a video conference endpoint, all that the moderator needs to do isto call a personal VMR number he/she subscribes to. The moderator canthen setup details for the call, including the rich media content thatwould form part of the welcome handshake with other participants, whichmay then be setup as default options for all calls hosted by themoderator. Other participants call into the VMR and enter the roomnumber specified for the conference call. On joining the VMR, they firstenjoy the rich media content setup as their welcome screen, includingcontent specific to the call, such as agenda, names of parties callingin, company related statistics etc. Such content could also be moregeneric for non-business applications, including any flash contentinclude videos, music, animations, ads etc. Upon joining the call, thedisplay also shows a code that is specific to the participant on his/herscreen, which can be applied to content to the call for content sharing.The code can also be entered from a web application used for the call orcan be driven through voice or visual commands that are recognized andprocessed by software in the internet cloud that are then translatedinto actionable events.

FIG. 13 depicts an example of one way in which a laptop or mobile phonecould be associated with a conference room system, wherein theparticipant uses an H/W room conferencing system 1002 and dials out to awell-known VMR 1004 using the directory entry on the remote. Onceconnected, user experience engine 106 plays back a welcome screen, alongwith a “session id” associate with this leg of the conference. Theparticipant goes to a web application 1006 or mobile application 1008and enters this session id into the application along with the meetingnumber of the VMR that he/she wishes to join, which places theparticipant into the VMR. Alternatively, a participant can join in avideo conference hosted in a VMR via one of the following ways:

-   -   Using touch tone on the conference room system    -   Controlling via voice recognition once dials in    -   Playing recognizable music or sound patterns from the laptop        into the conference room system    -   Showing some gestures or patterns to the camera of the        conference room once it is connected.

The experience as described above also provides the opportunity to nothave audio or video streams turn on by default for any of theparticipants. When all participants are settled in and the call is readyto start, the moderator could enable this globally, and each participantmay have fine-grained control over whether to turn on/off theiraudio/video as well. In some embodiments, this also allows formonetizable services to be provided while participants wait, such asstreaming advertisements that are localized to the participants region,timezone, demographics and other characteristics as determined by theservice in the internet cloud. In other embodiments, while people waitfor the call to begin, videos about new features introduced in theservice could be shown, while in some other embodiments, detailedinformation about participants on the call could be shown inmultimedia-rich formats that were not possible in prior-art.

Currently, consumers who wish to organize a video call, have only twosets of options available—either choose a business/professional optionthat uses H323 endpoints such as Polycom or Tandberg system, or uselimited functionality/quality desktop applications that show thempostage-stamp quality video of participants, typically on a simple orbland background or interface.

To address this situation, user experience engine 106 providespersonalizable VMRs to allow participants to customize or personalizehis/her conference experience on a per call basis as depicted by theexample of the diagram of FIG. 15, which dramatically transforms,revolutionizes, and democratizes the video conference experience. Forbusiness users, user experience engine 106 provides the layout andbackground for the call resembling a conference room, or a similarprofessional setting, and different types of backgrounds, welcome music,welcome banner and other status and chat messages and tickers during thecall.

The moderator for the call can pick this from a suite of optionsprovided to him/her based on the subscription plan. For retailconsumers, the experience would be a lot more informal andtransformative. The caller could decorate his/her room in any way he/sheprefers. The participant's personal website for the VMR could besimilarly decorated and customized. During the course of the call, userexperience engine 106 may extract or read in these customizable optionsspecified by the different participants and place them in thiscustomized VMR so that the experience is much more enriched than aconventional call.

Offering such personalized conference service over the Internet/cloudhas the distinct advantage of removing processor and computingcapabilities at any of the endpoint equipment. As long as the endpointsare able to receive and process encoded video streams, user experienceengine 106 are able to provide any level of media-rich content to theparticipants as part of their call experience, all of which can becontrolled and setup by the moderator of the VMR.

For a two-person conversation, instead of a traditional flat layoutshowing both parties side by side, user experience engine 106 maypresent a 3D layout where the input videos from both participants areset to make it look like that they are looking at each other so thatother participants at the video conference see the conversation happenmore naturally. Similarly, for a non-traditional application such asremote medicine or a conference call where patients can talk to doctorsremotely, the conference itself could be made to resemble a doctor'soffice. Patients could watch a video together about some health relatedissue while they wait for the doctor, and once the doctor calls in, theexperience could simulate a virtual doctor's office visit. Otherapplications include but are not limited to scenarios such as recruitingcould have their own customized layout and look-and-feel of a video callwith the resume of interviewee being visible to the interviewers intheir video, and can be edited and annotated by the interviewers, butthis may be hidden from the interviewee.

A “meet-me” service such as ours preserves anonymity of callers byallowing them to call in from any software or hardware endpoint withoutthe callee finding any personal identifiable information about thecaller.

A significant pain point for Web users currently is a lack of convenientand complete solutions to collaborate remotely. There are many scenariosin which users need to share what is currently on their screen—adrawing, a video, the current state of their machine during onlinetroubleshooting, to name a few—with a remote user. The only way to dothis currently is to be signed in to a desktop client that supportsscreen sharing, ask a contact for permission to start sharing. If onedoes not have such a desktop client, or the person they wish to share ascreen with is not a contact on that client, this method fails.Moreover, these solutions aren't available on mobile phone and othersmall screen devices.

In some embodiments, user experience engine 106 creates a single online“home” for personalized sharing one's desktop, laptop and/or mobilescreen with other video conference endpoints. As discussed herein,screen sharing refers to the act of being able to see a remote machine'sscreen by displaying one's own screen to the screen of another/remotevideo conference endpoints, or both in a streaming fashion. Some minorvariants of this include allowing the remote user to see only parts ofone's screen, giving them the additional ability to interact with one'sscreen etc. For non-limiting examples, user experience engine 106provides one or more of the following features for screen sharing:

-   -   Addressable in a personalized, consistent way over HTTP or        HTTPS. For a non-limiting example as shown in FIG. 16, a user of        the service would be allotted a URL of the form        http://myscre.en/joeblow to serve as a persistent access link to        the user's screen, whenever the user is sharing one or more of        his/her screens. The user can then share this URL with his/her        friends, colleagues, on their online profiles on social networks        etc. Here, the URL can be a so-called TinyURL, which is a        shortened URL (typically <10 characters, including domain        ending) that can serve as an easy shorthand for a location on        the web.    -   Access to one's screen sharing URL will be customizable with a        default option to only be available when the user is actively        choosing to share his/her screen. Moreover, a combination of        participant passcodes, timed screen share sessions and IP        address filtering options are provided to the user to ensure        maximum control over the people the user shares his/her screens        with.    -   While in the screen share mode, participants will be shown the        list of available screens and can choose to view one or more of        these. Depending on the host's permission settings, they may        also be given remote access to interact with the screen being        shared.

Companies such as Skype have created browser plug-ins that allow aparticipant to easily call any number that is displayed in his/herbrowser with one click by showing a “Skype phone” logo next to anynumber displayed in the browser and routing these calls through a Skypedesktop client. On the other hand, users today commonly have theironline contacts in one of a few stores—Google contacts, Exchange, Yahoocontacts, Skype, Facebook etc. While these contacts can be interacted indifferent ways from within native applications (for a non-limitingexample, hovering over a Google contact gives users a menu with optionsto mail or IM that contact), there is no simple pervasive way to offerone-click video call functionality across different contact protocolssimilar to what Skype does for numbers.

In some embodiments, user experience engine 106 supports web browserand/or desktop plug-ins to enable intelligent one-click video conferencecalls to participants (as opposed to numbers) on VMR contacts fromrecognized protocols of the video conference points. As used herein,plug-in refers to a small piece of software that extends thecapabilities of a larger program, which is commonly used in web browsersand desktop applications to extend their functionality in a particulararea.

In some embodiments, user experience engine 106 creates plug-ins thatoffer such functionality wherever contacts from recognized protocols aredisplayed in the browser (ex. Gmail, Y! Mail, etc.) and/or desktopapplications (MS Outlook, Thunderbird etc.). As shown in the example ofFIG. 17, the one-click video conference call plug-in offered by userexperience engine 106 has at least the following features:

-   -   1. A user has to agree to install the plug-in(s) and the        applications in which they will be active.    -   2. For enabled applications, every contact from a recognized        protocol (tentatively, Exchange and Google contacts) has a        “video call” logo next to it. For a non-limiting example, if the        sender of a mail in a user's Exchange mailbox was on a        recognized protocol, the display interface of the mailbox is        enhanced with a video call logo and a small arrow to show more        options.    -   3. Clicking on the logo launches a video conference call via a        VMR between the user and that contact, with an appropriate        choice of endpoints for either end.    -   4. Clicking on the arrow provides users with the complete list        of ways in which they can interact with this contact via the VMR        service, including audio calls, scheduling a future call, etc.

In some embodiments, user experience engine 106 performs automatic videogain control when some rooms of the video conference call are too brightand cause distraction. Similar to AGC (Auto Gain Control) in audiosystems, the brightness of all rooms of the video endpoints that arepart of the video conference can be adjusted to give a semblance thatthe conference is happening in the same place. Optionally, automaticvideo gain control can be turned on by a participant to the conferencewho feels that the brightness of one or more of the rooms aredisturbing.

In some embodiments, user experience engine 106 provides liveinformation about cost saving of the ongoing video conference, such asmiles saved, cost of gas and hotel per call, achieved by the videoconference. Here, the distance between participants can be calculatedbased on geo-location of IP addresses and mileage and federal mileagecredits can be calculated based on the miles saved per call in order tocome up with a total dollar amount to be rendered on the screen. Thecarbon offsets that can be claimed can also be computed based on thelocations of participants and duration of the call, and displayed to theparticipants appropriately.

Virtual reality (VR) representations today straddle the spectrum betweenavatars that are preconfigured to be picked from, to static images thatcan be uploaded and animated to a limited extent. In a multiparty videocall setting, there is no way to either re-locate the participants ontoa virtual world while keeping their personas real-world, ortransplanting them into a VR world as well as “avatarizing” theirpersonas at the same time.

In some embodiments, user experience engine 106 presents photo-realisticvirtual reality (VR) events to participants to a video conferencethrough the MCUs of the global infrastructure engine 104 to resolve bothissues discussed above. Like a conventional video conference call, theVMR takes in the inputs of audio/video streams from the differentparticipants' cameras, composites them together into one, encodes andsends the composite video to each participant separately. When theparticipants desire a VR version of an event, user experience engine 106takes one or more of the additional steps as follows to deliver such VRexperience to the participants as depicted in the diagram of the exampleof FIG. 18:

-   -   1. Image detection and segmentation component 1802 receives the        input video from each participant.    -   2. Segmentation component 1802 detects and extracts out a        participant from background of the video stream and provides        metadata about his/her location and other features in the video        stream.    -   3. User experience engine 106 then animates the participant via        virtual reality rendering component 1804 by adding various        characteristics to the face of the participant, or transforming        the face by applying any image transformation algorithm. It may        perform further analysis, face and feature detection and fully        animate the face of the participant to create a semi-animated        version of the face itself.    -   4. Video compositor 302 within media processing node 300 then        replaces the extracted background with the background overlaid        with VR rendered (animated) participant and provides the video        stream to other participants.        With such an approach, user experience engine 106 is able to        capture and transform the input video and audio streams from        different participants in different customizable ways to achieve        different user experiences.

In some embodiments, user experience engine 106 may extract allparticipants out of their environments of video streams, and then addthem back to a common environment and sent together as one video stream.For a non-limiting example, different participants calling in fromdifferent geographical locations can all be made to look like they areseated across each other at a conference table and having aconversation.

Providing localized, real-time offering of ads related to servicesavailable to a user in a particular geographical area has immense marketapplication and benefits. The few current solutions that exist relyheavily or purely on GPS related information, or high performanceprocessor on the mobile device to do the processing required to generatethe information. In some embodiments, user experience engine 106 enablesinternet/cloud-based augmented-reality user interaction services via theMCUs of the global infrastructure engine 104. More specifically, userexperience engine 106 analyzes the video stream captured from aparticipant/user's video conference endpoint (e.g., a cell phone camera)and provides augmented-reality video feed back to the user withannotations on services available in the geographical area of theparticipant, such as local events, entertainment and dining options. Allthe user needs to do is to place a video call to a VMR, while directinghis/her camera towards the objects of his/her interest. As shown in theexample depicted in the diagram of FIG. 19, user experience engine 106and global infrastructure engine 104 takes away any requirement forprocessor capability at the user's device to process the received videoin the cloud, analyze it via image detection and segmentation component1902 for, for non-limiting examples, billboards, identifiable landmarksin order to check against the GPS information from location servicedatabase 1904 or obtained from the user's GPS information to determinehis/her whereabouts. User experience engine 106 then modifies the inputvideo feed using the gathered geological information of the user andoverlays the video stream via metadata compositor 1906 with metadatasuch as names of restaurants within walking distance, names ofentertainment options locally to generate an augmented reality feed forthe user.

In the example of FIG. 19, image detection and segmentation component1302 is at the core of this logic, which analyzes the input video andextracts zones of interest within the video. Location service database1904 is populated with information about different zip-codes which cantake in GPS data and/or zipcodes as input and provide a rich set of dataabout the services in that area and any other offerings that would be ofinterest. Metadata compositor 1906 renders metadata in real time thatwill take the inputs from image detection and segmentation component1902 and location service database 1904 and overlay the input video feedwith useful metadata about the surroundings as discussed above.

In some embodiments, user experience engine 106 may provide a guidedtour of the area as the user walks around the area, where userexperience engine 106 may pre-fill the screen with more informationabout the sights and sounds of the area. In some embodiments, userexperience engine 106 may populate that information into the video aswell and show pictures of friends who might be nearby in order to tiethis augmented reality service into existing services for locatingfriends within the same neighborhood.

In some embodiments, the augmented reality service provided by userexperience engine 106 is customizable, not just at the time ofinstallation of any software downloaded onto a mobile device of theuser, but on a per-use basis. The user may use the augmented realityservice for various purposes, for non-limiting examples, a 411 lookup atone instant, and immediately after that, the user may call in and get avirtual tour of a local tourist highlight. Soon after that, the user mayask for restaurant related information to go grab a meal. As moreinformation becomes available on third party sites about each location,the user experience engine 106 provides a seamless way to tie up witheach of those providers over the Internet/cloud to offer more currentinformation to each user. Since such an approach is completelyrack-and-stack, depending solely on the plan that a user chooses, thecalls can be run through a system with more processing capabilities toextract and provide more information to the user, thus providing a fullsuite of pricing options depending on the feature set needed by eachuser.

In some embodiments, user experience engine 106 supports real-timetranslator-free multimedia communications during a live video conferenceby translating between different languages in real-time in the cloud, sothat participants to the VMR could speak to each other in differentlanguages and still continue on an intelligent conversation. Morespecifically, the real-time cloud-based translation may include but islimited to one or more of the following options:

-   -   Real voice plus subtitles in one common language, e.g., a        videoconference where different speakers could be speaking in        different languages and the translation/subtitling gets done        seamlessly in the cloud;    -   Translation from speech-to-visual for speech-initiated services        such as search and location-based services;    -   Translated voices in language that each participant can select        for him/herself;    -   Same language as what the speaker is speaking, but choice of        different voices to replace the speaker's voice;    -   Simultaneous delivery of multimedia addresses/sessions in        different languages to different users through the cloud;    -   Applications other than audio/video transmission, such as        real-time translation of documents/inputs from the format of the        sender to any supported format that any receiver chooses to        receive data in. The conversion happens in real-time in the        cloud.        Given the latency built into videoconferencing, a service like        this would do away with the need to have human translators when        two or more parties are communicating in different languages.

One embodiment may be implemented using a conventional general purposeor a specialized digital computer or microprocessor(s) programmedaccording to the teachings of the present disclosure, as will beapparent to those skilled in the computer art. Appropriate softwarecoding can readily be prepared by skilled programmers based on theteachings of the present disclosure, as will be apparent to thoseskilled in the software art. The invention may also be implemented bythe preparation of integrated circuits or by interconnecting anappropriate network of conventional component circuits, as will bereadily apparent to those skilled in the art.

One embodiment includes a computer program product which is a machinereadable medium (media) having instructions stored thereon/in which canbe used to program one or more hosts to perform any of the featurespresented herein. The machine readable medium can include, but is notlimited to, one or more types of disks including floppy disks, opticaldiscs, DVD, CD-ROMs, micro drive, and magneto-optical disks, ROMs, RAMs,EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or opticalcards, nanosystems (including molecular memory ICs), or any type ofmedia or device suitable for storing instructions and/or data. Stored onany one of the computer readable medium (media), the present inventionincludes software for controlling both the hardware of the generalpurpose/specialized computer or microprocessor, and for enabling thecomputer or microprocessor to interact with a human viewer or othermechanism utilizing the results of the present invention. Such softwaremay include, but is not limited to, device drivers, operating systems,execution environments/containers, and applications.

The foregoing description of various embodiments of the claimed subjectmatter has been provided for the purposes of illustration anddescription. It is not intended to be exhaustive or to limit the claimedsubject matter to the precise forms disclosed. Many modifications andvariations will be apparent to the practitioner skilled in the art.Particularly, while the concept “interface” is used in the embodimentsof the systems and methods described above, it will be evident that suchconcept can be interchangeably used with equivalent software conceptssuch as, class, method, type, module, component, bean, module, objectmodel, process, thread, and other suitable concepts. While the concept“component” is used in the embodiments of the systems and methodsdescribed above, it will be evident that such concept can beinterchangeably used with equivalent concepts such as, class, method,type, interface, module, object model, and other suitable concepts.Embodiments were chosen and described in order to best describe theprinciples of the invention and its practical application, therebyenabling others skilled in the relevant art to understand the claimedsubject matter, the various embodiments and with various modificationsthat are suited to the particular use contemplated.

1.-31. (canceled)
 32. A video-conferencing system comprising: one or more media processing nodes, each node configured to: accept a plurality of audio and video streams from a plurality of endpoints in a video conference, at least some of the endpoints being associated with videoconferencing services that are incompatible with each other; and for each of the endpoints in the video conference: convert in real-time the plurality of audio and video streams into one or more composite audio and video streams compatible with the endpoint; and provide the one or more composite audio and video streams to the endpoint for rendering to one or more participants to the video conference that are associated with the endpoint.
 33. The system of claim 32, wherein the one or more media processing nodes are multi-party conferencing units built from off-the-shelf components.
 34. The system of claim 32, wherein each of the plurality of endpoints is associated with a videoconferencing service that is one of standards-based or proprietary.
 35. The system of claim 32, wherein the video streams comprise at least one of original compressed video, uncompressed raw video, and low resolution compressed thumbnail video.
 36. The system of claim 32, wherein converting the plurality of audio and video streams comprises converting, into a form compatible with the endpoint, at least one of video encoding formats, audio encoding formats, communication protocols, video resolutions, screen ratios, encryption standards, and acoustic considerations of the audio and video streams.
 37. The system of claim 32, wherein converting the plurality of audio and video streams into the composite audio and video streams comprises at least one of transcoding, upscaling, and downscaling the audio and video streams.
 38. The system of claim 32, wherein at least one of the media processing nodes is further configured to subscribe to one or more audio and video streams required to construct composite audio and video streams that are compatible with an endpoint.
 39. The system of claim 32, wherein at least one of the media processing nodes is further configured to transfer composite audio and video streams to at least one other media processing node.
 40. The system of claim 32, wherein each of the media processing nodes is further configured to track video metadata associated with decoded video streams in order to apply operations based on the video metadata to individual video streams in the one or more composite video streams.
 41. The system of claim 32, wherein each of the media processing nodes is further configured to create a unique audio stream for each endpoint in the video conference based on audio streams associated with other endpoints, thereby reducing noise and echo in the composite audio streams.
 42. A method of operating a video conference, the method comprising: accepting, by a media processing node, a plurality of audio and video streams from a plurality of endpoints in a video conference, at least some of the endpoints being associated with videoconferencing services that are incompatible with each other; and for each of the endpoints in the video conference: converting in real-time the plurality of audio and video streams into one or more composite audio and video streams compatible with the endpoint; and providing the one or more composite audio and video streams to the endpoint for rendering to one or more participants to the video conference that are associated with the endpoint.
 43. The method of claim 42, wherein each of the plurality of endpoints is associated with a videoconferencing service that is one of standards-based or proprietary.
 44. The method of claim 42, wherein the video streams comprise at least one of original compressed video, uncompressed raw video, and low resolution compressed thumbnail video.
 45. The method of claim 42, wherein converting the plurality of audio and video streams comprises converting, into a form compatible with the endpoint, at least one of video encoding formats, audio encoding formats, communication protocols, video resolutions, screen ratios, encryption standards, and acoustic considerations of the audio and video streams.
 46. The method of claim 42, wherein converting the plurality of audio and video streams into the composite audio and video streams comprises at least one of transcoding, upscaling, and downscaling the audio and video streams.
 47. The method of claim 42, further comprising subscribing, by the media processing node, to one or more audio and video streams required to construct composite audio and video streams that are compatible with an endpoint.
 48. The method of claim 42, further comprising transferring, by the media processing node, composite audio and video streams to at least one other media processing node.
 49. The method of claim 42, further comprising tracking, by the media processing node, video metadata associated with decoded video streams in order to apply operations based on the video metadata to individual video streams in the one or more composite video streams.
 50. The method of claim 42, further comprising creating, by the media processing node, a unique audio stream for each endpoint in the video conference based on audio streams associated with other endpoints, thereby reducing noise and echo in the composite audio streams. 